Skills: Describing data relationships
Week 1
This page illustrates some skills that will be useful to you as you learn to describe variables in a dataset and the relationships among them.
Generating simulated data
In practice, generating simulated (i.e. fake) data is generally not going to yield useful insights about the real world, but it can be a good way to confirm that your methods doing what you think they’re doing.
Uniformly-distributed continuous variables
A uniformly-distributed variable is one where all values within a range are equally likely.
R
In R, you can generate a uniform random variable using the
runif() function. It takes three arguments:
n: The number of values to generatemin: The lower end of the range of possible valuesmax: The upper end of the range of possible values
Here is how you would generate 1000 numbers that are uniformly distributed between 13 and 87.
The default values for min and max are zero
and one, so if you want random values between zero and one, you can
leave those arguments out.
Excel
The equivalent function in Excel is RANDBETWEEN(). It
takes two arguments arguments:
bottom: The lower end of the range of possible valuestop: The upper end of the range of possible values
A related function that will generate a random number from a uniform
distribution between zero and one is RAND()
Normally-distributed continuous variable
Many real-world quantities are normally distributed within a population. In a normal distribution, values close to the average values are the most likely, and values futher from the average (whether lower or higher) are less likely.
A normal distribution can be described in terms of its mean and standard deviation.
R
In R, you can use the rnorm() function to generate a set
of random numbers that are normally distributed. The function takes
three arguments:
n: The number of values you want to generate.mean: The average value of the distribution you want to draw from.sd: The standard deviation of the distribution you want to draw from.
Here is how you would generate a set of 100 random numbers from a normal distribution with a mean of 150 and a standard deviation of 30.
Excel
In Excel, the function NORM.INV will return a value that
a random variable from normal distribution with a specified mean and
standard deviation has a specified probability of being less than. If
the specified probability is a random number between zero and one, this
will be equivalent to the rnorm() function in R.
Dependent categorical variable
You can assign values to a categorical variable based on the values of a continuous variable. If you want values with specified probabilities of being in each of multiple categories, you can use a uniformly-distributed categorical variable as a starting point.
For example, if you want 30 percent of your observations to be in category A, 50 percent of your observations to be in category B, and the remaining 20 percent to be in category C, You could generate a random uniform variable and assign category A to values less than 0.3, category B to the remaining values that are less than 0.8, and category C to everything else.
R
The case_when() function offers an easy way to do that
in R. It will work best if you organize your variables into a tibble
first.
library(tidyverse)
my_data <- tibble(uniform = my_uniform,
zero_to_one = my_zero_to_one,
normal = my_normal) |>
mutate(categories = case_when(zero_to_one < 0.3 ~ "A",
zero_to_one < 0.8 ~ "B",
TRUE ~ "C"))Here are the first few rows of the resulting tibble:
| uniform | zero_to_one | normal | categories |
|---|---|---|---|
| 41.56931 | 0.3944369 | 127.7646 | B |
| 76.12677 | 0.3218309 | 162.7315 | B |
| 36.36313 | 0.2904716 | 202.7221 | A |
| 78.64203 | 0.1432587 | 140.0582 | A |
| 13.56316 | 0.7662607 | 214.0131 | B |
| 63.44494 | 0.4750678 | 126.5300 | B |
I can also create a separate column with a 1 (true) or 0 (false) value for each of those three categories.
my_data <- my_data |>
mutate(A = ifelse(categories == "A", 1, 0),
B = ifelse(categories == "B", 1, 0),
C = ifelse(categories == "C", 1, 0))
my_data |>
head() |>
kable()| uniform | zero_to_one | normal | categories | A | B | C |
|---|---|---|---|---|---|---|
| 41.56931 | 0.3944369 | 127.7646 | B | 0 | 1 | 0 |
| 76.12677 | 0.3218309 | 162.7315 | B | 0 | 1 | 0 |
| 36.36313 | 0.2904716 | 202.7221 | A | 1 | 0 | 0 |
| 78.64203 | 0.1432587 | 140.0582 | A | 1 | 0 | 0 |
| 13.56316 | 0.7662607 | 214.0131 | B | 0 | 1 | 0 |
| 63.44494 | 0.4750678 | 126.5300 | B | 0 | 1 | 0 |
Excel
In Excel, you can do something similar with nested IF
functions. An IF function in Excel takes three
functions:
- A logical statement that is either true or false
- The value to return if the statement is true
- The value to return if the statement is false
If the final argument (the value to return if the statement is false)
is another IF statement, then you are nesting one
IF statement inside another.
Remember to always lock your cell references before pasting a formula down a column (this is an error in the example below).
And you can also use if statements to create a separate column with a 1/0 (true/false) value for each category.
Dependent continuous variable
When you generate multiple random variables, they will be independent from one another. You can also generate a variable that depends on the value of other variables.
In the examples below, I’ll create a new variable that has a linear relationship with the normally-distributed variable and the categorical variable I just created. For observations in category A, the value of the new variable will be ninety percent of the value of the independent normal variable I previously generated. Observations in category B will be five points higher than if they were in category A, and observations in category C will be 10 points lower than if they were in category A.
R
Here is how I would calculate that new value in R.
| uniform | zero_to_one | normal | categories | A | B | C | outcome |
|---|---|---|---|---|---|---|---|
| 41.56931 | 0.3944369 | 127.7646 | B | 0 | 1 | 0 | 119.9881 |
| 76.12677 | 0.3218309 | 162.7315 | B | 0 | 1 | 0 | 151.4584 |
| 36.36313 | 0.2904716 | 202.7221 | A | 1 | 0 | 0 | 182.4499 |
| 78.64203 | 0.1432587 | 140.0582 | A | 1 | 0 | 0 | 126.0524 |
| 13.56316 | 0.7662607 | 214.0131 | B | 0 | 1 | 0 | 197.6118 |
| 63.44494 | 0.4750678 | 126.5300 | B | 0 | 1 | 0 | 118.8770 |
Excel
And here is how I would do it in Excel.
Visualizing data
Some common graphics to visualize data include histograms, scatter plots, tree maps, and violin plots.
Histogram
A histogram is a useful way to visualize the distubution of a continuous variable.
R
In R (ggplot2), geom_hist will generate a histogram.
Here, I’ll check to see whether the random normal variable I generated really does approximate a normal distribution.
Excel
You can create a histogram in Excel using the data analysis tool pack. By default, it will put gaps between the bars. It is more common to show histograms without these gaps, so you can remove them.
Scatter plot (continuous-continuous)
A scatter plot is commonly used to show the relationship between two continuous variables.
R
You can use geom_point to generate a scatter plot in
R.
Excel
You can create a scatter plot in Excel using the “Insert Chart” button.
Looking through the examples so far, how does the scatter plot I just created in Excel look different from the one I created in R? Why the difference?1
Scatter plot (continuous-categorical)
You can also use a scatter plot to show the relationship between a continuous variable and a categorical variable.
R
In R, you can use set a categorical variable for either the x axis or
the y axis, and set the parameter position = "jitter" to
add random variation within the category positions.
ggplot(my_data) +
geom_point(aes(x = categories,
y = outcome),
position = "jitter") +
theme_linedraw()Excel
You can acheive the same effect by adding that random variation manually in Excel. You would create a variable to represent the positions of points along the x-axis.
Then you can create a scatter plot as usual.
Descriptive statistics
You may want to describe your data with descriptive statistics.
Mean
The mean is the average value of a variable. It is a measure of central tendency that is sensitive to outliers.
R
The function in R that will calculate a mean is called
mean().
## [1] 136.0257
You can do some fancy data reshaping to get the means of multiple variables in a neat little table.
data_summary <- my_data |>
pivot_longer(cols = c("normal", "outcome"),
names_to = "Variable",
values_to = "value") |>
group_by(Variable) |>
summarize(Average = mean(value))
kable(data_summary, format.args = list(digits = 4))| Variable | Average |
|---|---|
| normal | 150.4 |
| outcome | 136.0 |
Excel
In Excel, the function is AVERAGE().
Standard deviation
The standard deviation is a measure of how spread out the data are. It is sensitive to outliers.
R
In R, the function sd() will calculate a standard
deviation.
## [1] 29.99083
Again, you can get this into a neat little table with some data carpentry magic.
data_summary <- my_data |>
pivot_longer(cols = c("normal", "outcome"),
names_to = "Variable",
values_to = "value") |>
group_by(Variable) |>
summarize(Average = mean(value),
`Standard deviation` = sd(value))
kable(data_summary, format.args = list(digits = 4))| Variable | Average | Standard deviation |
|---|---|---|
| normal | 150.4 | 29.99 |
| outcome | 136.0 | 27.61 |
Excel
In Excel, the function to calculate a standard deviation is
STDEV.P().
Median and other percentiles
Percentiles are the values that a specified percentage of observations of a variable are lower than. The 50th percentile is also called the median: It’s the value that half of all observations are less than.
The median is measure of central tendency that is not sensitive to outliers.
The range between lower and upper percentiles can be a useful measure of central tendency. For example, the interquartile range (IQR) is the range between the 25th and 75th percentile. Half of all values fall within this range.
R
In R, you can calculate a percentile with the function
quantile()
## 50%
## 150.3205
pctile_75 = quantile(my_data$normal, probs = 0.75)
pctile_25 = quantile(my_data$normal, probs = 0.25)
IQR = pctile_75 - pctile_25
as.numeric(IQR)## [1] 40.41127
And again you can put those in a neat little table.
data_summary <- my_data |>
pivot_longer(cols = c("normal", "outcome"),
names_to = "Variable",
values_to = "value") |>
group_by(Variable) |>
summarize(Average = mean(value),
`Standard deviation` = sd(value),
Median = quantile(value, probs = 0.5),
IQR = quantile(value, probs = 0.75) - quantile(value, probs = 0.25))
kable(data_summary, format.args = list(digits = 4))| Variable | Average | Standard deviation | Median | IQR |
|---|---|---|---|---|
| normal | 150.4 | 29.99 | 150.3 | 40.41 |
| outcome | 136.0 | 27.61 | 135.8 | 37.16 |
Excel
In Excel, the function to calculate percentiles is
PERCENTILE.EXC().
Proportions
If you’ve set up a column for each category, where a value of 1 means the observation is in that category and a value of zero means it is not, the average of one of those columns is the proportion of observations within that category.
Use the method for calculating an average in either R or Excel.
Regression
We calculated the outcome variable as a function of the normally-disributed variable and the category:
\[ outcome = (0.9)normal + (5)B - (10)C \]
If we didn’t know what equation was used to generate the outcome variable, we could reverse-engineer it using regression.
R
In R, we can estimate a linear regression model using the function
lm().
##
## Call:
## lm(formula = outcome ~ normal + B + C, data = my_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.605e-11 -9.000e-16 6.200e-15 1.350e-14 4.364e-13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.572e-14 2.836e-14 -9.070e-01 0.364
## normal 9.000e-01 1.769e-16 5.088e+15 <2e-16 ***
## B 5.000e+00 1.226e-14 4.079e+14 <2e-16 ***
## C -1.000e+01 1.555e-14 -6.429e+14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.75e-13 on 4996 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 9.031e+30 on 3 and 4996 DF, p-value: < 2.2e-16
Excel
In Excel, we can estimate a regression model using the regression tool in the data analysis tool pack.
Notice that the regression output are static values rather than formulas. The values will not update when the data used to generate them updates.
The answer is that both the normally-distributed variable and the categorical variable were calculated from the same uniformly-distributed variable, so they are not independent from one another. This is an error (assuming these are meant to represent independent variables).↩︎